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Abstract. In multi-parameter models, reference priors typically depend on the 
parameter or quantity of interest, and it is well known that this is necessary 
to produce objective posterior distributions with optimal properties. There are, 
however, many situations where one is simultaneously interested in all the param¬ 
eters of the model or, more realistically, in functions of them that include aspects 
such as prediction, and it would then be useful to have a single objective prior 
that could safely be used to produce reasonable posterior inferences for all the 
quantities of interest. In this paper, we consider three methods for selecting a sin¬ 
gle objective prior and study, in a variety of problems including the multinomial 
problem, whether or not the resulting prior is a reasonable overall prior. 

Keywords: Joint Reference Prior, Logarithmic Divergence, Multinomial Model, 
Objective Priors, Reference Analysis. 


1 Introduction 

1.1 The problem 

Objective Bayesian methods, where the formal prior distribution is derived from the 
assumed model rather than assessed from expert opinions, have a long history (see 
e.g.^ Bernardo and Smith, 1994; Kass and Wasserman, 1996, and references therein). 
Reference priors (Bernardo, 1979, 2005; Berger and Bernardo, 1989, 1992a,b, Berger, 
Bernardo and Sun, 2009, 2012) are a popular choice of objective prior. Other interesting 
developments involving objective priors include Clarke and Barron (1994), Clarke and 
Yuan (2004), Consonni, Veronese and Gutierrez-Peha (2004), De Santis et al. (2001), De 
Santis (2006), Datta and Ghosh (1995a; 1995b), Datta and Ghosh (1996), Datta et al. 
(2000), Ghosh (2011), Ghosh, Mergel and Liu (2011), Ghosh and Ramamoorthi (2003), 
Liseo (1993), Liseo and Loperfido (2006), Sivaganesan (1994), Sivaganesan, Laud and 
Mueller (2011) and Walker and Gutierrez-Peha (2011). 

In single parameter problems, the reference prior is uniquely defined and is invari¬ 
ant under reparameterization. However, in multiparameter models, the reference prior 
depends on the quantity of interest, e.^., the parameter concerning which inference is 
being performed. Thus, if data x are assumed to have been generated from pix \ uj), 
with u> G C and one is interested in 0{u>) G 0 C 5ft, the reference prior TTg{uj), 
will typically depend on ft; the posterior distribution, '!rg{u} \ x) oc p{x \ uj) 7re(a;), thus 
also depends on ft, and inference for ft is performed using the corresponding marginal 
reference posterior for ft(a;), denoted 'ng{9\x). The dependence of the reference prior 
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on the quantity of interest has proved necessary to obtain objective posteriors with ap¬ 
propriate properties - in particular, to have good frequentist coverageproperties (when 
attainable) and to avoid marginalization paradoxes and strong inconsistencies. 

There are however many situations where one is simultaneously interested in all the 
parameters of the model or perhaps in several functions of them. Also, in prediction and 
decision analysis, parameters are not themselves the object of direct interest and yet an 
overall prior is needed to carry out the analysis. Another situation in which having an 
overall prior would be beneficial is when a user is interested in a non-standard quantity 
of interest (e.^., a non-standard function of the model parameters), and is not willing or 
able to formally derive the reference prior for this quantity of interest. Computation can 
also be a consideration; having to separately do Bayesian computations with a different 
reference prior for each parameter can be onerous. Finally, when dealing with non¬ 
specialists it may be best pedagogically to just present them with one overall objective 
prior, rather than attempting to explain the technical reasons for preferring different 
reference priors for different quantities of interest. 

To proceed, let 6 = 6{u}) = {6i{uj),... be the set of to > 1 functions of 

interest. Our goal is to find a joint prior 7 r(a;) whose corresponding marginal posteriors, 
{ 7 r( 0 i I a:)}™ 1 , are sensible from a reference prior perspective. This is not a well-defined 
goal, and so we will explore various possible approaches to the problem. 

Example 1.1. Multinomial Example: Suppose x = (ii,..., Xm) is multinomial 
Mu(a; I n; 6 * 1 ,..., 9m), where Berger and Bernardo 

(1992b), the reference prior is derived when the parameter 9i is of interest, and this 
is a different prior for each 9i, as given in the paper. The reference prior for 9i results 
in a Beta reference marginal posterior Be{9i \ Xi + ^,n — Xi + ^). We would like to 
identify a single joint prior for 6 whose marginal posteriors could be expected to 
be close to each of these reference marginal posteriors, in some average sense. 

1.2 Background 

It is useful to begin by recalling earlier efforts at obtaining an overall reference prior. 
There have certainly been analyses that can be interpreted as informal efforts at ob¬ 
taining an overall reference prior. One example is given in Berger and Sun (2008) for 
the five parameter bivariate normal model. Priors for all the quantities of interest that 
had previously been considered for the bivariate normal model (21 in all) were studied 
from a variety of perspectives. One such perspective was that of finding a good over¬ 
all prior, defined as one which yielded reasonable frequentist coverage properties when 
used for at least the most important quantities of interest. The conclusion was that the 
prior 7r°(/ii,/i 2 ,CTi,(T 2 7 p) = l/[< 7 i< 72 (l — P^)], where the pi are the means, the ai are the 
standard deviations, and p is the correlation in the bivariate normal model, was a good 
choice for the overall prior. 

We now turn to some of the more formal efforts to create an overall objective prior. 

Invariance-based priors 

If p{x I uj) has a group invariance structure, then the recommended objective prior is 
typically the right-Haar prior. Often this will work well for all parameters that define the 
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invariance structure. For instance, if the sampling model is N(a;i | the right-Haar 

prior is T:[gL,a) = and this is fine for either p, or cr (yielding the usual objective 
posteriors). Such a nice situation does not always obtain, however. 

Example 1.2. Bivariate Normal Distribution: The right-Haar prior is not 
unique for the bivariate normal problem. For instance, two possible right-Haar 
priors are 7ri(/ii,/r 2 , cri, cr 2 , p) = l/[crf (1 - p^)] and 7 r 2 (pi, p 2 , o"!, o- 2 , p) = l/[o-i(l- 
p^)\. In Berger and Sun (2008) it is shown that is fine for Oi and p, but leads 
to problematical posteriors for the other mean and standard deviation. 

The situation can be even worse if the right-Haar prior is used for other parameters 
that can be considered. 

Example 1.3. Multi-Normal Means: Let Xi be independent normal with mean 
Hi and variance 1, for i = 1, ■ ■ ■ ,m. The right-Haar prior for p = (pi,..., Hm) is 
just a constant, which is fine for each of the individual normal means, resulting 
in a sensible N(pi | 1) posterior for each individual Hi- But this prior is bad for 

overall quantities such as 0 = d-h discussed in Stein (1959) 

and Bernardo and Smith (1994, p. 365). For instance, the resulting posterior mean 
of 0 is [1 -f ^ which is inconsistent as m —>■ oo (assuming ^ Pi 

a limit); indeed, it is easy to show that then [1 + ^ Sti [^t + 2 ], where 

9t is the true value of 9. Furthermore, the posterior distribution of 9 concentrates 
sharply around this incorrect value. 

Constant and vague proper priors 

Laplace (1812) advocated use of a constant prior as the overall objective prior and 
this approach, eventually named inverse probability, dominated statistical practice for 
over 100 years. But the problems of a constant prior are well-documented, including the 
following: 

(i) Lack of invariance to transformation, the main criticism directed at Laplace’s 
approach. 

(ii) Frequent posterior impropriety. 

(iii) Possible terrible performance, as in the earlier multi-normal mean example. 

Vague proper priors (such as a constant prior over a large compact set) are per¬ 
ceived by many as being adequate as an overall objective prior, but they too have well- 
understood problems. Indeed, they are, at best, equivalent to use of a constant prior, 
and so inherit most of the flaws of a constant prior. In the multi-normal mean example, 
for instance, use of N(/ri | 0,1000) vague proper priors results in a posterior mean for 9 
that is virtually identical to the inconsistent posterior mean from the constant prior. 

There is a common misperception that vague proper priors are safer than a constant 
prior, since a proper posterior is guaranteed with a vague proper prior but not for a 
constant prior. But this actually makes vague proper priors more dangerous than a 
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constant prior. When the constant prior results in an improper posterior distribution, 
the vague proper prior will yield an essentially arbitrary posterior, depending on the 
degree of vagueness that is chosen for the prior. And to detect that the answer is 
arbitrary, one has to conduct a sensitivity study concerning the degree of vagueness, 
something that can be difficult in complex problems when several or high-dimensional 
vague proper priors are used. With the constant prior on the other hand, the impropriety 
of the posterior will usually show up in the computation—the Markov Chain Monte 
Carlo (MCMC) will not converge—and hence can be recognized. 


Jeffreys-rule prior 

The Jeffreys-rule prior (Jeffreys, 1946, 1961) is the same for all parameters in a model, 
and is, hence, an obvious candidate for an overall prior. If the data model density is 
p{x I 6 ) the Jeffeys-rule prior for the unknown 9 = {0i,..., 6m} has the form 

7r(0i,...,0^) = |/(0)|i/2, 


where I { 9 ) is the m x m Fisher information matrix with (i,j) element 


H^)ij — Ea; I 0 


dOidOj 


\ogp{x I 9 ) 


This is the optimal objective prior (from many perspectives) for regular one-parameter 
models, but has problems for multi-parameter models. For instance, the right-Haar prior 
in the earlier multi-normal mean problem is also the Jeffreys-rule prior there, and was 
seen to result in an inconsistent estimator of 6. Even for the basic N(a;i | p, a) model, the 
Jeffreys-rule prior is 7r(/i,cr) = l/cr^, which results in posterior inferences for p and a 
that have the wrong ‘degrees of freedom.’ 

For the bivariate normal example, the Jeffreys-rule prior is l/[tT^CT|(l — p^)^]; this 
yields the natural marginal posteriors for the means and standard deviations, but results 
in quite inferior objective posteriors for p and various derived parameters, as shown in 
Berger and Sun (2008). More, generally, the Jeffreys-rule prior for a covariance matrix 
is studied in Yang and Berger (1994), and shown to yield a decidedly inferior posterior. 

There have been efforts to improve upon the Jeffreys-rule prior, such as consideration 
of the “independence Jeffreys-rule prior,” but a general alternative definition has not 
resulted. 

Finally, consider the following well-known example, which suggests problems with the 
Jeffreys-rule prior even when it is proper. 

Example 1.4. Multinomial Distribution (continued): Consider the multino¬ 
mial example where the sample size n is small relative to the number of classes m; 
thus we have a large sparse table. The Jeffreys-rule prior is the proper prior, 
Tr{0i,... ,9m) oc : ^ut is not a good candidate for the overall prior. 

For instance, suppose n = 3 and m = 1000, with X 2 iQ = 2, x^-jq = 1; and all the 
other Xi = 0. The posterior means resulting from use of the Jeffreys-rule prior are 

Xi 1/2 Xi -\- 1/2 Xi -\- 1/2 


503 
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so E[024 o I x] = 1^, E[0876 I E[0i | x] = ^ otherwise. So, cells 240 and 876 

only have total posterior probability of ^ = 0.008 even though all 3 observations 
are in these cells. The problem is that the Jeffreys-rule prior effectively added 1/2 
to the 998 zero cells, making them more important than the cells with data! That 
the Jeffreys-rule prior can encode much more information than is contained in the 
data is hardly desirable for an objective analysis. 

An alternative overall prior that is sometimes considered is the uniform prior on the 
simplex, but this is even worse than the Jeffreys prior, adding 1 to each cell. The 
prior that adds 0 to each cell is ]/[j but this results in an improper posterior if 
any cell has a zero entry, a virtual certainty for very large tables. 

We actually know of no multivariable example in which we would recommend the 
Jeffreys-rule prior. In higher dimensions, the prior always seems to be either ‘too dif¬ 
fuse’ as in the multinormal means example, or ‘too concentrated’ as in the multinomial 
example. 

Prior averaging approach 

Starting with a collection of reference (or other) priors {71^(0), i = 1,, m} for differing 
parameters or quantities of interest, a rather natural approach is to use an average of 
the priors. Two natural averages to consider are the arithmetic mean 



and the geometric mean 



While the arithmetic average might seem most natural, arising from the hierarchical 
reasoning of assigning each probability 1/m of being correct, geometric averaging 
arises naturally in the definition of reference priors (Berger, Bernardo and Sun, 2009), 
and also is the optimal prior if one is trying to choose a single prior to minimize the 
average of the Kullback-Leibler (KL) divergences of the prior from the tt^’s (a fact 
of which we were reminded by Gauri Datta). Furthermore, the weights in arithmetic 
averaging of improper priors are rather arbitrary because the priors have no normalizing 
constants, whereas geometric averaging is unaffected by normalizing constants. 

Example 1.5. Bivariate Normal Distribution (continued): Faced with the 
two right-Haar priors in this problem, 

Tri(fJ.i,/r2,oi,a2,p) = - p^)~\ P2, <^2, p) = - P^)~\ 

the two average priors are 


7r^(Mi,M2,o-i,cr2,p) 


1 


1 


( 1 ) 


2cr2(l _ p2) + 2al(l - p2) 


7r‘^(/4l,M2,0-l,0-2,p) 


1 


( 2 ) 


ai(72(l - p2) ' 
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Interestingly, Sun and Berger (2007) show that is a worse objective prior than 
either right-Haar prior alone, while is the overall recommended objective prior. 

One problem with the averaging approach is that each of the reference priors can 
depend on all of the other parameters, and not just the parameter of interest, 0i, for 
which it was created. 

Example 1.6. Multinomial Example (continued): The reference prior derived 
when the parameter of interest is 0i actually depends on the sequential ordering 
chosen for all the parameters (e.g. { 0 i, 0 i, 02 , ■ ■ ■ ,0i-i,0i+i, ■ ■ ■ ,0m}) hr the ref¬ 
erence prior derivation; there are thus (m — 1)! different reference priors for each 
parameter of interest. Each of these reference priors will result in the same marginal 
reference posterior for 0^, 


ng.{0 ,1 x) = Be{0i\x, + ^,n-Xi + \), 

but the full reference prior and the full posterior, ttq.{0\x), do depend on the 
ordering of the other parameters. There are thus a total of m! such full reference 
priors to be averaged, leading to an often-prohibitive computation. 

In general, the quality of reference priors as overall priors is unclear, so there is no 
obvious sense in which an average of them will make a good overall reference prior. 
The prior averaging approach is thus best viewed as a method of generating interesting 
possible priors for further study, and so will not be considered further herein. 

1.3 Three approaches to construction of the overall prior 

Common reference prior 

If the reference prior that is computed for any parameter of the model (when declared 
to be the parameter of interest) is the same, then this common reference prior is the 
natural choice for the overall prior. This is illustrated extensively in Section 2; indeed, 
the section attempts to catalogue the situations in which this is known to be the case, 
so that these are the situations with a ready-made overall prior. 

Reference distance approach 

In this approach, one seeks a prior that will yield marginal posteriors, for each 0i of 
interest, that are close to the set of reference posteriors {■K{0i \ x)}^i (yielded by the 
set of reference priors {Tre^(tij)}(Ti), in an average sense over both posteriors and data 
X G X. 

Example 1.7. Multinomial Example (continued): In Example 1.4 consider, 
as an overall prior, the Dirichlet Di(0 | a,..., a) distribution, having density propor¬ 
tional to Hi leading to Be(0i | Xi+a, n—Xi+{m—l)a) as the marginal posterior 
for 0i. In Section 3.2, we will study which choice of a yields marginal posteriors that 
are as close as possible to the reference marginal posteriors Be(0i \ xi + ^,n — Xi + ^), 
arising when 0i is the parameter of interest. Roughly, the recommended choice is 
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a = 1/m, resulting in the overall prior 7r°(0i, ..., 0^) oc Oi^i Note that 

this distribution adds only 1/m = 0.001 to each cell in the earlier example, so that 

I , Xt + l/m Xi + l/m a;*+0.001 

Thus E[024 o I x] Ri 0.5, E[0876 I x] Ri 0.25, and B[9i \ x] r; otherwise, all sensible 
results. 


Hierarchical approach 

Utilize hierarchical modeling to transfer the reference prior problem to a ‘higher level’ 
of the model (following the advice of I. J. Good). In this approach one 

(i) Chooses a class of proper priors 'k{ 6 \ a) reflecting the desired structure of the 
problem. 

(ii) Forms the marginal likelihood p{x | a) = / p{x \ a)'K{6 \ a) dO. 

(iii) Finds the reference prior, 7r^(a), for a in this marginal model. 

Thus the overall prior becomes 

Tr°{9) = (■n{0 \a)'K^{a) da , 


although computation is typically easier by utilizing both 6 and a in the computation 
rather than formally integrating out a. 

Example 1.8. Multinomial (continued) The Dirichlet Di(0 | a,..., a) class of 
priors is natural here, reflecting the desire to treat all the 6i similarly. We thus need 
only to find the reference prior for a in the marginal model. 


p{x I a) 




r(m a) 


^(ma) + a) ^ 

r(a)"* r(n + ma) 


Wor^do 


( 3 ) 


The reference prior for Tr^{a) would just be the Jeffreys-rule prior for this marginal 
model; this is computed in Section 4. The implied prior for 9 is, of course 


— / Di(0 I a) 7r^(a) do . 


Interestingly, 7r^(a) turns out to be a proper prior, necessary because the marginal 
likelihood is bounded away from zero as a —?► oo. 

As computations in this hierarchical setting are more complex, one might alter¬ 
natively simply choose the Type-II maximum likelihood estimate— i.e., the value 
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of a that maximizes (3) —at least when m is large enough so that the empiri¬ 
cal Bayes procedure can be expected to be close to the full Bayes procedure. 
For the data given in the earlier example (one cell having two counts, another 
one count, and the rest zero counts), this marginal likelihood is proportional to 
[a(a -|- l)]/[(ma -I- l)(ma -|- 2)], which is maximized at roughly a = y/^fm. In 
Section 4 we will see that it is actually considerably better to maximize the refer¬ 
ence posterior for a, namely 7r^(a | x) oc p{x \ a) 7r^(a), as it can be seen that the 
marginal likelihood does not go to zero as a —>■ oo and the mode may not even 
exist. 

1.4 Outline of the paper 

Section 2 presents known situations in which the reference priors for any parameter (of 
interest) in the model are identical. This section is thus the beginnings of a catalogue 
of good overall objective priors. Section 3 formalizes the reference distance approach 
and applies it to two models—the multinomial model and the normal model where 
the coefficient of variation is also a parameter of interest. In Section 4 we consider 
the hierarchical prior modeling approach, applying it to three models—the multinomial 
model, a hypergeometric model, and the multinormal model—and misapplying it to the 
bivariate normal model. Section 5 presents conclusions. 

2 Common reference prior for all parameters 

In this section we discuss situations where the reference prior is unique, in the sense 
that it is the same no matter which of the specified model parameters is taken to 
be of interest and which of the possible possible parameter orderings is used in the 
derivation. (In general, a reference prior will depend on the parameter ordering used in 
its derivation.) This unique reference prior is typically an excellent choice for the overall 
prior. 


2.1 Structured diagonal Fisher information matrix 

Consider a parametric family p(x \ 6) with unknown parameter 6 = - ■ ■ , Ok)- For 

any parameter 0^, let 6^i = (0 i,--- , 0i_i, 0 ^+ 1 , • • • ,0^) denote the parameters other 
than Oi- The following theorem encompasses a number of important situations in which 


there is a common reference prior for all parameters. 

Theorem 2.1. Suppose that the Fisher information matrix of 6 is of the form, 

I{e) = diag{h{eMe_^),h{02)92{9-2),--- Jk{0m)gk{e-k)), (4) 

where ft is a positive function of Oi and gi is a positive function of d-i, for i = 1, - ■ ■ ,k. 
Then the one-at-a-time reference prior, for any chosen parameter of interest and any 
ordering of the nuisance parameters in the derivation, is given by 

CX Vfli9l)f2{02)---fk{0k). (5) 

Proof. The result follows from Datta and Ghosh (1996). □ 
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This prior is also what was called the independent reference prior in Sun and Berger 
(1998), and is the most natural definition of an independence Jeffreys prior under con¬ 
dition (4). Note that being the common reference prior for all of the original parameters 
of interest in the model does not guarantee that tt^ will be the reference prior for ev¬ 
ery potential parameter of interest (see Section 3.1) but, for the scenarios in which an 
overall prior is desired, this unique reference prior for all natural parameters is arguably 
optimal. 

A simple case in which (4) is satisfied is when the density is of the form 

k 

p{x\e) = Y[pi{x,\ei), (6) 

i=l 

with X decomposable as a; = {xi,. .. ,X}f). In this case the (common to all parameters) 
reference prior is simply the product of the reference priors for each of the separate 
models pi{xi \0i)-, this is also the Jeffreys-rule prior. 

Bivariate binomial distribution 

Crowder and Sweeting (1989) consider the following bivariate binomial distribution, 
whose probability density is given by 

p{r, s\ 0 i,O 2 )= eiil - 0^r-^ Q 0^(1 - 02 )’'-^ 

where 0 < 0 i ,02 < 1, and s and r are nonnegative integers satisfying 0 < s < r < n. 
The Fisher information matrix for (0i,02) is given by 

I{ 0 i, 02 ) = n diag[{0i(l - 0i)}-i, 0 i{ 02 (l - 02 )}-^], 

which is of the form (4). (Note that this density is not of the form (6).) Hence the 
reference prior, when either 0i or 02 are the parameter of interest, is 

'^^{01.02) « {0l(l - 0l)02(l - 02)}-^ , 

i.e., independent Beta, Be(0i | 1/2,1/2) distributions for 0i and 02; this reference prior 
was first formally derived for this model by Poison and Wasserman (1990). This is thus 
the overall recommended prior for this model. 

Multinomial distribution for directional data 

While we have already seen that determining an overall reference prior for the multino¬ 
mial distribution is challenging, there is a special case of the distribution where doing 
so is possible. This happens when the cells are ordered or directional. For example, 
the cells could be grades for a class such as A, B, C, D, and F; outcomes from an 
attitude survey such as strongly agree, agree, neutral, disagree, and strongly disagree; 
or discrete survival times. Following Example 1.1 (multinomial example), with this cell 
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ordering, there is a natural reparameterization of the multinomial probabilities into the 
conditional probabilities 




for j 


1, • • • , m — 1. 


(7) 


Here is the conditional probability of an observation being in cell j given that the 
observation is in cells j to m. The Fisher information matrix of (^i, • • • , ^m-i) is 


I*{^ 1,^2 ■ ■ ■ ,'Cm-i) = n diag(? 7 i,- • • , 77 ^- 1 ), 


( 8 ) 


where 

for J = 1 , • • • , m — 1 . Clearly ( 8 ) is of the form (4), from which it immediately follows 
that the one-at-a-time reference prior for any of the parameters (^ 1,^2 • • • jCm-i) (and 
any ordering of them in the derivation) is the product of independent Beta (1/2,1/2) 
distributions for the for j = 1,... m — 1. This is the same as Berger and Bernardo’s 
(1992b) reference prior for this specific ordering of cells. 

A two-parameter exponential family 

Bar-Lev and Reiser (1982) considered the following two-parameter exponential family 
density: 

p(x| 0 i, 6 > 2 ) = a(a;)exp{ 6 »iC/i(x) - 6 »iG 2 ( 6 ' 2 )C/ 2 (x) - 7 /( 01 , 6 » 2 )}, (9) 

where the Ui{-) are to be specified, 0i < 0, 02 = E{[/ 2 (X) | ( 0 i, 02 )}, the Gi(-)’s, are in¬ 
finitely differentiable functions with G" > 0, and 7/(0i, 02 ) = — 0 i{ 02 G' 2 ( 02 ) — G 2 ( 02 )} + 
Gi( 02 ). This is a large class of distributions, which includes, for suitable choices of Gi, 
G 2 , Ui and U2, many popular statistical models such as the normal, inverse normal, 
gamma, and inverse gamma. Table 1, reproduced from Sun (1994), indicates how each 
distribution arises. 


Table 1. Special cases of Bar-Lev and Reiser’s (1982) two parameter exponential family, 
where /i(0i) = -0i -I- 0i log(-0i) -I- log(r(-0i)). 



Gi(0i) 

G2(02) 

Ui[x) 

U2{x) 

9i 

92 

Normal (/r, a) 

-ilog(-20i) 



X 

-l/(2a^) 


Inverse Gaussian 

-ilog(-20i) 

1192 

1/x 

X 

-af2 


Gamma 

K9^) 

- log 02 

— log a: 

X 

—a 


Inverse Gamma 

h{9^) 

- log 02 

log a: 

\/x 

—a 



The Fisher information matrix of (0i,02) based on (9) is 


/( 01 , 02 ) 


G'/(0i) 

0 


0 

0iG"(02) 
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which is of the form (4). Thus, when either 0i or 02 is the parameter of interest, the 
one-at-a-time reference prior (first shown in Sun and Ye (1996)) is 

^^{01,02) = ^lG'i{ei)G'i{e2). ( 10 ) 

For the important speciai case of the Inverse Gaussian density, 

p(x I a,-0) = (a/27rx^)^/^ exp| —-Q;a;(l/a; — V’)^|, a; > 0 ( 11 ) 

where a > 0, > 0, the common reference prior (and overaii recommended prior) is 

7r^(a,V') oc . (12) 

ay/ip 

The resuiting marginai posteriors of a and tp can be found in Sun and Ye (1996). 

For the important speciai case of the Gamma {a, jj) density, 

p{x\a,gL) = exp(—aa;//i)/{F(a)/x“}, (13) 

the common reference prior (and overaii recommended prior) is 

'K^{a,p) ^ , (14) 

Va/r 

where ^(a) = (9^/9a^) iog{F(a)} is the poiygamma function. The resuiting marginai 
posteriors of a and p can be found in Sun and Ye (1996). 


A stress-strength model 

Gonsider the foiiowing stress-strength system, where Y, the strength of the system, is 
subject to stress X. The system faiis at any moment the appiied stress (or ioad) is 
greater than the strength (or resistance). The reiiabiiity of the system is then given by 

0 = P{X<Y). (15) 

An important instance of this situation was described in Enis and Geisser (1971), 
where Xi, • • • , Xm, and Yi, • • • ,Yn are independent random sampies from exponentiai 
distributions with unknown means rji and 772 , in which case 

d = Vi/{Vi+V2)- (16) 


As the data density is of the form (6), the (common to aii parameters) reference prior 
is easiiy seen to be 'K^{rji,r] 2 ) = 1 /( 771772 ), which is also the Jeffreys prior as noted in 
Enis and Geisser (1971). Our interest, however, is primarily in 0. Defining the nuisance 
parameter to be 7 / = the resulting Fisher information matrix is 


l{0,il)) = diag 


{m + n)0'^ — 0)'^ {m + nY"ip‘^ 
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again of the form (4). So the Jeffreys prior and the one-at-a-time reference prior of 
any ordering for 9 and ip is 7r^{9,ilj) = 1/{0(1 — 0)^/;}, which can be seen to be the 
transformed version of TT^{r]i,r] 2 ) = 1 /( 771772 )- So the Jeffreys prior is also the one-at-a- 
time reference prior for 6. Ghosh and Sun (1998) showed that this prior is the second 
order matching prior for 9 when tti/tt, —>■ a > 0. 

2.2 Other scenarios with a common reference prior 

A common reference prior can exist in scenarios not covered by Theorem 2.1. Two such 
situations are considered here, the first which leads to a fine overall prior and the second 
which does not. 


The location-scale family 

Consider the location-scale family having density 


p{x\fx,a) = -g (^^— 

a \ a / 


where g is a specified density function and 77 € IR and cr > 0 are both unknown. The 
Fisher information of (/i, a) is 


1 ( 77 , ct) 


I{y^-^^+9'iy)}dy \ 
+ 9'{y)}dy J ^ dy ) 


(17) 


Although this is not of the form (4), it is easy to see that the one-at-a-time reference 
prior for either 77 or cr is 71 ^( 77 , cr) = 1/a. This prior is also the right-Haar prior for the 
location-scale group, and known to result in Bayes procedures with optimal frequentist 
properties. Hence it is clearly the recommended overall prior. 


Unnatural parameterizations 

A rather unnatural parameterization for the bivariate normal model arises by dehning 
t/i = l/cri ,'!/2 = l/\/cr|(l — p^), and 7/3 = —pa^/ax. From Berger and Sun (2008), the 
Fisher information matrix for the parameterization (t/i, 7 / 2 , 7 / 3 , 771 , 772 ) is 


I 


2Lih 


2 

2 ’ 



(18) 


where S ^ ^ ^ ■ While this is not of the form (4), direct compu¬ 

tation shows that the one-at-a-time reference prior for any of these hve parameters and 
under any ordering is 


7r^(V’i,V’2,V'3,Mi,M2) = 


(19) 


Unfortunately, this is equivalent to the right Haar prior, 7r^(cri, 172 , 77 , 771 , 772 ) = 
which we have argued is not a good overall prior. This suggests that the parameters 
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used in this ‘common reference prior’ approach need to be natural, in some sense, to 
result in a good overall prior. 

3 Reference distance approach 

Recall that the goal is to identify a single overall prior 7r(u)) that can be systematically 
used for all the parameters 0 = 9(uj) = {0i(a;),..., of interest. The idea of the 

reference distance approach is to find a 7r(ci;) whose corresponding marginal posteriors, 
{TT{9i I a:)}™ ^ are close, in an average sense, to the reference posteriors {7Ti{9i | £c)}™ ^ 
arising from the separate reference priors derived under the assumption 

that each of the 0i’s is of interest. (In situations where reference priors are not unique 
for a parameter of interest, we assume other considerations have been employed to select 
a preferred reference prior.) In the remainder of the paper, 6 will equal u), so we will 
drop u} from the notation. 

We first consider the situation where the problem has an exact solution. 

3.1 Exact solution 

If one is able to find a single joint prior tt{6) whose corresponding marginal posteriors 
are precisely equal to the reference posteriors for each of the 6i’s, so that, for all x G X, 

Tr{9i\x) = 7ri{9i\x), i = (20) 

then it is natural to argue that this should be an appropriate solution to the problem. 
The most important situation in which this will happen is when there is a common 
reference prior for each of the parameters, as discussed in Section 2. It is conceivable 
that there could be more than one overall prior that would satisfy (20); if this were to 
happen it is not clear how to proceed. 

Example 3.1. Univariate normal data. Consider data x which consist of a random 
sample of normal observations, so that p{x \ 6) = p(x \pL,<j) = 0”=! I o')) 
suppose that one is equally interested in p (or any one-to-one transformation of p) 
and a (or any one-to-one transformation of a, such as the variance cr^, or the 
precision cr~^.) The common reference prior when any of these is the quantity of 
interest is known to be the right Haar prior 7r^(/i, cr) = 'k„{p,(t) = cr“^, and this 
is thus an exact solution to the overall prior problem under the reference distance 
approach (as is also clear from Section 2.2, since this is a location-scale family). 

Interestingly, this prior also works well for making joint inferences on {p, a) in 
that it can be verified that the corresponding joint credible regions for {p, a) have 
appropriate coverage properties. This does not mean, of course, that the overall 
prior is necessarily good for any function of the two parameters. For instance, if 
the quantity of interest is the centrality parameter 9 = pja, the reference prior is 
easily found to be TTe{9,a) = (1 -b (Bernardo, 1979), which is not the 

earlier overall reference prior. Finding a good overall prior by the reference distance 
situation when this is added to the list of parameters of interest is considered in 
Section 3.2. 
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3.2 Reference distance solution 

When an exact solution is not possible, it is natural to consider a family of candidate 


prior distributions, J- = {7r(0|a),a G A}, and choose, as the overall prior, the distri¬ 
bution from this class which yields marginal posteriors that are closest, in an average 
sense, to the marginal reference posteriors. 

Directed logarithmic divergence 

It is first necessary to decide how to measure the distance between two distributions. 
We will actually use a divergence, not a distance, namely the directed logarithmic or 
Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) given in the following 
definition. 

Definition 1. Let p{ip) be the probability density of a random vector if G , and con¬ 
sider an approximation Poiip) with the same or larger support. The directed logarithmic 
divergence of po from p is 



provided that the integral exists. 

The non-negative directed logarithmic divergence k{pq \ p} is the expected log-density 
ratio of the true density over its approximation; it is invariant under one-to-one transfor¬ 
mations of the random vector xp' and it has an operative interpretation as the amount of 
information (in natural information units or nits) which may be expected to be required 
to recover p from po- It was first proposed by Stein (1964) as a loss function and, in a 
decision-theoretic context, it is often referred to as the entropy loss. 

Weighted logarithmic loss 

Suppose the relative importance of the 9i is given by a set of weights {wi,..., Wm}, with 
0 < Wi < 1 and Wi = 1. A natural default value for these is obviously Wi = 1/m, but 
there are many situations where this choice may not be appropriate; in Example 1.3 
for instance, one might give 9 considerably more weight than the means pt. To define 
the proposed criterion, we will also need to utilize the reference prior predictives for 
i = 1,..., m. 



Definition 2. The best overall prior within the family T = {7r(0|a),a G A} 

is defined as that—assuming it exists and is unique—which minimizes the weighted 
average expected logarithmic loss, so that 


7r°(0) = 7r(0|a*), a* = arg inf dia)^ 

a^A 
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This can be rewritten, in terms of the sum of expeeted risks, as 


d{a) = 


E 

2 = 1 


Wi 


\e)7reMde, 


I& 


ae A, 


where 

Pi{a\e)= / ^^{■Kg.[■\x,a)\■Kg^{■\x)}p(x\e)dx, 9 g&. 

Jx 

Note that there is no assurance that d{a) will be finite if the reference priors are 
improper. Indeed, in cases we have investigated with improper reference priors, d{a) has 
failed to be finite and hence the reference distance approach cannot be directly used. 
However, as in the construction of reference priors, one can consider an approximating 
sequence of proper priors {TTg. (0 \ k),k = 1, 2...} on increasing compact sets. For each 
of the 710.(0 \k), one can minimize the expected risk 


m p 

d{a\k) = '^Wi / pi{a\e)Tre.{0\k)de, 

i=i 


obtaining = arginfag _4 d{a \ k). Then, if a* = limfe_>oo o-k exists, one can declare this 
to be the solution. 


Multinomial model 

In the multinomial model with m cells and parameters {0i,..., 0m}, with = 1, 

the reference posterior for each of the 9fs is 77^(0^ | a;) = Be(0i | Xi + ^,n—Xi+^), while the 
marginal posterior distribution of 9i resulting from the joint prior Di(0i,..., 0m-i | a) is 
Be(0i \ xi + a,n — Xi + (m — I)a). The directed logarithmic discrepancy of the posterior 
Be(0i \ xi + a,n — Xi + {m—l)a) from the reference posterior Be(0i \ xi-\-\,n —Xi +is 

Ki{a I X, m, n} = Ki{a \xi,m,n} = n^^{xi -\- a,n — Xi-\- {jn — I)a \xi -\- \,n — Xi + \} 


where 


7tBe{Q:o,/3o I a,/3} 


Be(0i I a,/3) log 


• Be(0,1 a,/3) ~ 
-Be(0i I ao,/3o)- 


d9i 


r r(a + /3) r(ao) r(/3o) ' 

’ °®[r(ao+/3o) r(a) r(/3) _ 

+ (a - ao)ip{a) + (/3 - MipiP) - {{a + /3) - (oo + /3o))7/'(a + /?), 


and '(/'(•) is the digamma function. 

The divergence Ki{a \ Xi,m,n} between the two posteriors of 9i depends on the data 
only through Xi and the sampling distribution of Xi is Binomial Bi(a:i | n, 9i), which only 
depends of 0^. Moreover, the marginal reference prior for 0^ is Trg. (0^) = Be(0i 11/2,1/2) 
and, therefore, the corresponding reference predictive for Xi is 


p{xi I n) 


Bi(a:, I n, 0,) Be(0,11/2,1/2) d0, 


1 r(a;i + I) r(n - Xj + ^) 
TT T{xi + 1) r(n — Xi + 1) 


0 
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Figure 1: Expected logarithmic losses, when using a Dirichlet prior with parameter 
{a,..., a}, in a multinomial model with m cells, for sample sizes n = 5,10, 25,100 and 
500. The panels are for m = 10,100, 200 and 1000. In all cases, the optimal value for all 
sample sizes is a* ~ 0.8/m. 


Hence, using Definition 2 with uniform weights, the average expected logarithmic loss 
of using a joint Dirichlet prior with parameter a with a sample of size n is simply 

n 

d{a I TO, n) = Acja | x, to, n}p{x \ n) 

x—Q 

since, by the symmetry of the problem, the to parameters {Oi,..., 0m} yield the same 
expected loss. 
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The function d(a | m = 10, n) is graphed in the upper left panel of Figure 1 for several 
values of n. The expected loss decreases with n and, for any n, the function d(a | m, n) 
is concave, with a unique minimum numerically found to be at a* k.= 0 . 8 /m = 0.08. 
The approximation is rather precise. For instance, the minimum is achieved at 0.083 for 
n= 100. 

Similarly, the function d{a \ m = 1000, n) is graphed in the lower right panel of Figure 1 
for the same values of n and with the same vertical scale, yielding qualitatively similar 
results although, as one may expect, the expected losses are now larger than those 
obtained with m = 10. Once more, the function d{a\m = 1000,n) is concave, with a 
unique minimum numerically found to be at a* 0 . 8 /m = 0.0008, with the exact value 
very close. For instance, for n = 100, the minimum is achieved at 0.00076. 

If can be concluded that, for all practical purposes when using the reference dis¬ 
tance approach, the best global Dirichlet prior, when one is interested in all the pa¬ 
rameters of a multinomial model, is that with parameter vector { 1 /m,..., 1 /m} (or 
0.8 X {1/m,..., 1/m} to be slightly more precise), yielding an approximate marginal 
reference posterior for each of the 9ds as Be{9i \ Xi + 1 /m, n — Xi + {m — l)/m), having 
mean and variance 

B[9i \ xi,n] =9i = {xi + l/m)/(n-|- 1), Yar[9i \xi,n] = 9i{l - 9i)/{n + 2). 


The normal model with coefficient of variation 

Consider a random sample 2 : = {xi ,..., x„} from a normal model N(x | /r, a), with both 
parameters unknown, and suppose that one is interested in ^ and a, but also in the 
standardized mean </ = /i/cr (and/or any one-to-one function of them such as logcr, or 
the coefficient of variation cr//i). 

The joint reference prior when either fj, or cr are the quantities of interest is 

7r^(/i, cr) = 7r<^(/r,cr) = (21) 

and this is known to lead to the Student and squared root Gamma reference posteriors 

I z) = St{fi I X, s/y/n — 1, n — 1), 7r^'^'^(cr | z) = Ga“^^^(CT | (n — l)/2, ns'^12), 

with nx = X]r=i ~ x)'^, which are proper if n > 2, and have the 

correct probability matching properties. However, the reference prior if </> is the param¬ 
eter of interest is Tr,p{(j), a) = {2 + (Bernardo, 1979), and the corresponding 

reference posterior distribution for (/ can be shown to be 

oc (2-k(/^)"^/^p(t|(/), 

where t = (X^ti ^j)/(S"=i ^ sampling distribution p{t \ (p) depending only 

on (j) (see Stone and Dawid, 1972). Note that all posteriors can be written in terms of 
the sufhcient statistics x and and the sample size n, which we will henceforth use. 

A natural choice for the family of joint priors to be considered as candidates for an 
overall prior is the class of relatively invariant priors (Hartigan, 1964), 


J- = {7r(/i, cr I a) = cr “, a > 0} 
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which contains, for a = 1, the joint reference prior (21) when either /i or cr are the 
parameters of interest, and the Jeffreys-rule prior, for a = 2. Since these priors are 
improper, a compact approximation procedure, as described at the end of Section 3.2, 
is needed. The usual compactification for location-scale parameters considers the sets 

Cfc = {^ G (-/c, k), a G e'")}, fe = 1,2,.... 


One must therefore derive 


d{a \ n,k) = d^(a \ n^k) + da-{a \ n,k) + d^{a \ n, k), 


where each of the dds is found by integrating the corresponding risk with the appropri¬ 
ately renormalized joint reference prior. Thus, 


d^,{a\n,k) = K{7r^(-1 n, t, a) 17r(;'=^(-1 n, i)}p(t | n, /r, cr) 

JCk Ut 

da{a\n,k) = /c{7rcr(- 1 n, t, a) 11 n, t)}p(t | n,/c, tr) 

JCk Ut 

d^{a\n,k) = K{Tr^{-\n,t,a) \ {■ \n,t)} p{t\n, dt 

JCk Jt 


7 r^(^, a\k)dpL dcr. 


Tra-ifJ:, a\k) dp da, 


TT^{p, a\k)dp da, 


where t = {x,s), and the TVi{p,a \ fe)’s are the joint proper prior reference densities of 
each of the parameter functions obtained by truncation and renormalization in the Cfc’s. 

It is found that the risk associated to p (the expected KL divergence of 7r^(- | n,t,a) 
from 1 n,t) under sampling) does not depend on the parameters, so integration 

with the joint prior is not required, and one obtains 


d^{a I n) 


r[n/2]r[(a-bn)/2-1] 
r[(n- l)/2]r[(a-bn- l)/2]_ 




where '0[-] is the digamma function. This is a concave function with a unique minimum 
di(l |n) = 0 at a = 1, as one would expect from the fact that the target family P 
contains the reference prior for p when a = 1. The function dfi{a\n = 10) is the lower 
dotted line in Figure 2. Similarly, the risk associated to cr does not depend either of the 
parameters, and one obtains 


d^ia \n,k) = do-(a | n) 


r[(a-bn)/ 2 - 1 ] 
r[(n-l)/2] 




another concave function with a unique minimum ^ 2 ( 1 1 n) = 0, at a = 1. The function 
dcr(a I n = 10) is the upper dotted line in Figure 2. 

The risk associated with (j) cannot be analytically obtained and is numerically com¬ 
puted, using one-dimensional numerical integration over (j) to obtain the KL divergence, 
and Monte Carlo sampling to obtain its expected value with the truncated and renor¬ 
malized reference prior 'K^{p, a\k). The function drf,{a \ n = 10, fc = 3) is represented by 
the black line in Figure 2. It may be appreciated that, of the three components of the 
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Figure 2: Expected average intrinsic logarithmic losses d{a \ n, k) associated with the 
use of the joint prior 7r(^, cr | a) = cr““ rather than the corresponding reference priors 
when n = 10 and fc = 3. 



Figure 3: Reference posterior (solid) and marginal overall posterior (black) for (/) given 
a minimal random sample of size n = 2. The dotted line is the marginal posterior for 
the prior with a = 2, which is the Jeffreys-rule prior. 


expected loss, the contribution corresponding to (p is the largest, and that corresponding 
to ^ is the smallest, in the neighborhood of the optimal choice of a. The sum of the 
three is the expected loss to be minimized, d{a \n,k). The function d{a \ n = 10, fc = 3) 
is represented by the solid line in Figure 2, and has a minimum at a| = 1.016. The 
sequence of numerically computed optimum values is {a^.} = {1.139,1.077,1.016,...} 
quickly converging to some value a* larger than 1 and smaller than 1.016, so that, prag¬ 
matically, the overall objective prior may be taken to be the usual objective prior for 
the normal model, 

7 r°(^,cr) = cr"^ 

It is of interest to study the difference in use of this overall prior when compared 
with the reference prior for (j) = gilcr. The difference is greater for smaller samples, and 
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the minimum sample size here is n = 2. A random sample of two observations from 
N(a; 11, i) (so that the true value of the standardized mean is </> = 2) was simulated 
yielding = {0.959,1.341}. The corresponding reference posterior for (f> is the 

solid line in Figure 3. The posterior that corresponds to the recommended overall prior 
a = 1 is the black line in the figure. For comparison, the posterior corresponding to the 
prior with a = 2, which is Jeffreys-rule prior, is also given, as the dotted line. Thus, 
even with a minimum sample size, the overall prior yields a marginal posterior for tp 
which is quite close to that for the reference posterior. (This was true for essentially all 
samples of size n = 2 that we tried.) For sample sizes beyond n = 4 the differences are 
visually inappreciable. 


4 Hierarchical approach with hyperpriors 

If a natural family of proper priors 7 r (0 | a), indexed by a single parameter a, can be 
identified for a given problem, one can compute the marginal likelihood p(x | a) (nec¬ 
essarily a proper density), and find the reference prior 7r^(a) for a for this marginal 
likelihood. This hierarchical prior specification is clearly equivalent to use of 

7r°(0) = J 7 r(0 I a) -K^{a)da 

as the overall prior in the original problem. 


4.1 Multinomial problem 

The hierarchical prior 

For the multinomial problem with the Di(0 | a,..., a) prior, the marginal density of any 
of the XiS is 


p{xi I a,m,n) 


f n \ T(xi + a) r(n — Xi + (m — l)a) F(ma) 
r(a) r((m — l)a) r(n -I-m a) 


following immediately from the fact that, marginally, 


p{xi I Oi) = Bi(a;i | n, 9i) 7r(6»i | a) = Be(0i | a, (m - l)a). 

Then 7r^(a), the reference (Jeffreys) prior for the integrated model p{x\a) in (3), is 
given in the following proposition: 

Proposition 4.1. 

1/2 


where Q{- \ a, m, n) is the right tail of the distribution of p(x \ a, m, n), namely 

n 

Q{j\a,m,n)= p{l\a,m,n), j = 0,...,n-l. 

i=j+i 


Tr^{a I m, n) oc 


n—1 

E 

j=0 


Qjj I a,rn,n) 

(a-f j)2 


m 


-jy 


(22) 
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Proof. Computation yields that 


E 


where Si'^ce the Xi are exchangeable, this equals 


-^logp(a:|a) 


n—1 


= -E 


3=0 


(ma + j)2 


■E 


m Xi — 1 

EE 

i=l i=0 


{a + jY 


(23) 


■'3=0 


n — 1 


-E 

3=0 


(m a + jY 


mE 


,Xi 


Jti-i 

E 

3=0 


{a + jY 


and the result follows by rearranging terms. 
Proposition 4.2. 7r^(a) is a proper prior. 


□ 


Proof. The prior is clearly continuous in a, so we only need show that it is integrable 
at 0 and at oo. Consider first the situation as a —>■ oo. Then 


p(0|a,m,n) = 


r(a)r(n + [m— l]a)r(TO a) 
r(a)r([m — l]a)r(n + ma) 

(m — l)a[(m — l)a + 1] • • • [{m — l)a + n — 1] 


(to — 1) 


TO a{m a + 1 ) • • • (to a + n — 1) 
(1 - c„a + 0(a^)), 


where c„ = 1 + 1/2 + • • • + l/(n — 1). Thus the first term of the sum in (22) is 

1 -p(0 I a,m,n) _]_ ^ (to - 1)c„ 

ma? ma 

All of the other terms of the sum in (22) are clearly 0(1), so that 


Rf N _ \/(to- 1)c„/to 


7r"(a) = 


■/a 


O(v^), 


as a —>■ 0, which is integrable at zero (although unbounded). 

To study propriety as a —>• oo, a laborious application of Stirling’s approximation 
yields 

p{xi I a, TO, n) = Bi(xi | n, 1 /to)(1 + 0{a~^)), 

as a —>■ oo. Thus 

-1 1/2 

R, ^ ^ 1 ^ _3, 

-K^{a,m,n) = ^ ^^- ZTxi \ + ^(^ ) 


3=0 


TO 


ZBi(l I n, 1 /to) n 


TO 

which is integrable at infinity, completing the proof. 


+ 0 (a-") 


1/2 


= 0(a-3/2), 


□ 
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As suggested by the proof above, the reference prior n^{a \ m, n) behaves as 
near a = 0 and behaves as 0(a~^) for large a values. Using series expansions, it is 
found that, for sparse tables where m/n is relatively large, the reference prior is well 
approximated by the proper prior 

TT*{a\m,n) = ^ , (24) 

2 m \ m/ 

which only depends on the ratio m/n, and has the behavior at the extremes described 
above. This can be restated as saying that 4>{a) = a/{a+{n/m)) has a Beta distribution 
Be(^|i,l). Figure 4 gives the exact form of 7r^(a|m,n) for various {m,n) values, 
and the corresponding approximation given by (24). The approximate reference prior 
7r*(a I m, n) appears to be a good approximation to the actual reference prior, and hence 
can be recommended for use with large sparse contingency tables. 



0.0 0.5 1.0 1.5 2.0 

Figure 4: Reference priors 7r^(a|m, n) (solid lines) and their approximations (dotted 
lines) for (m = 150,n = 10) (upper curve) and for (m = 500,n = 10) (lower curve). 

It is always a surprise when a reference prior turns out to be proper, and this seems to 
happen when the likelihood does not go to zero at a limit. Indeed, it is straightforward 
to show that 

f as a —0, 

I “) = 1 

I U ’ as a oo, 

where tq is the number of nonzero Xi. Thus, indeed, the likelihood is constant at oo, so 
that the prior must be proper at infinity for the posterior to exist. 

Computation with the hierarchical reference prior 

If a full Bayesian analysis is desired, the obvious MCMC sampler is as follows: 

Step 1. Use a Metropolis Hastings move to sample from the marginal posterior 
7 r^(a I x) oc ■K^[a)p{x \ a). 

Step 2. Given a, sample from the usual beta posterior 7r(d | a, x). 
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This will be highly efficient if a good proposal distribution for Step 1 can be found. As 
it is only a one-dimensional distribution, standard techniques should work well. Even 
simpler computationally is the use of the approximate reference prior 7r*(a|m, n) in 
(24), because of the following result. 

Proposition 4.3. Under the approximate reference prior (24), and provided there are 
at least three nonempty cells, the marginal posterior distribution of a is log-concave. 


Proof. It follows from (23) that 
^ log[p(a; I a)7r*(a | m, n)] = ^ ■ 


m Xi — 1 


-T.T. 


.^^(ma + j)^ ^^(a-fj)2 20 ^ 2{a + n/m)^ 


Without loss of generality, we assume that > 0, for f = 1, 2, 3. Then 

■2 3 Xi-l 


da? 


log[p(a: I a)p*{a\m,n)] < -EE 


1 


+ < 0. 


i=2 j=0 


(«+j)^ 20^ 20^ 


□ 


Thus adaptive rejection sampling (Gilks and Wild, 1992) can be used to sample from 
the posterior of a. 

Alternatively, one might consider the empirical Bayes solution of fixing a at its poste¬ 
rior mode d^. The one caveat is that, when cq = 1, it follows from (25) that the likelihood 
is constant at zero, while 7r^(a) is unbounded at zero; hence the posterior mode will be 
a = 0, which cannot be used. When tq > 2, it is easy to see that ■n^{a)p{x \ a) goes to 
zero as a —^ 0, so there will be no problem. 

It will typically be considerably better to utilize the posterior mode than the maxi¬ 
mum of p(x I a) alone, given the fact that the likelihood does not go to zero at oo. For 
instance, if all = I, it can be shown that p(x | a) has a likelihood increasing in a, so 
that there is no mode. (Even when tq = I, use of the mode of p(x | a) is not superior, 
in that the likelihood is also maximized at 0 in that case.) 


Posterior behavior as m —)• cjo 

Since we are contemplating the “large sparse” contingency table scenario, it is of con¬ 
siderable interest to study the behavior of the posterior distribution as m —>■ oo. It is 
easiest to state the result in terms of the transformed variable v = ma. Let 7r,^(x | x) 
denote the transformed reference posterior. 

Proposition 4.4. 


4>(ri) = lim 7r,^(x|a;) 

m—>-oo 


yUo-i) 

r(ri -I- n) 


E 


i 


1/2 


{v iy 


(25) 


Proof. Note that 


TT^{a\x) oc TO(a; I a) 7r‘^(a) 
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T(m a + n) 
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r(a + Xi) 
r(a) 

.1 — 1 ^ ' 
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Ha) 


a(a + 1) s{a + Xi — 1) 

i:Xi^0 


TT^ia) 


n — 1 


n(«+jT 

j=0 


Ha), 


where rj = {4fxi > j}. Change of variables to v = ma yields 




(«l®) 


OC 


m 

r(n + n) 

r(w) 

r(n + n) 


n—1 

n(; 

j=0 


+ j 


c+ Y. K, 


2=1 


(26) 


where C = Y\H =2 ^^e Ki are constants. 

Next we study the behavior of TT^{v/m) for large m. Note first that, in terms of v, 
the marginal density of xi = 0 is 


p(0 |u) = 


+ r(u) 


rfiiiiiiili;') r(u + n) 

^ m ‘ 

(m —1) r(m —1) . i r(m —1) . 11 

- -V + 1 • • • \ - - -V + n — 1 

m L 771, J L m J 
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2 \ ' 
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Hence 


( 5(0 I a) = l-p( 0 |w) 




m ^—' V + i 

i—1 


m?{v + 1 )^ 

It follows that all Q{i\a) < 0{l/m), so that tt^ im) proportional to 

i=l ~ j^i \m 


= O \ — (uniformly in v) . 
' m, 
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( (w 

-1) 

m 


2^ V + i [ 

.i—1 ^ 

V 
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^ {v 
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+ *)2 

= \/m — 1 

'.. n—1 

E- 
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i 
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2 + 0 ( 


-I 1/2 
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Combining this with (26), noting that fr(i;) = r(ti + 1), and letting m —>■ oo, yields the 
result. □ 


It follows, of course, that a behaves like v/m for large m, where v has the distribution 
in (25). It is very interesting that this “large m” behavior of the posterior depends on 
the data only through tq, the number of nonzero cell observations. 

If, in addition, n is moderately large (but much smaller than m), we can explicitly 
study the behavior of the posterior mode of a. 

Proposition 4.5. Suppose m —>• oo, n —>• oo, and n/m ^ 0. Then (25) has mode 


V ~ 


(r-o-l.S) 

log(l+n/ro) 


zfr^^c<l, 


where tq is the number of nonzero Xi, c* is the solution to c* log(I + ^) = c, and 
f{n, m) ~ g{n, m) means f(n, m) jg{n, m) —>■ I. The corresponding mode of the reference 
posterior for a is d^ = v/m. 


Proof. Taking the log of (25) and differentiating with respect to v results in 
vi/'(t,) = (^0 ~ ^-5) _ 1 Si=i (v+iy 


^ v + i 

2 = 1 Z^l — 1 


Jv+Tp' 

Note first that, as n grows, and if v also grows (no faster than n), then 
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dx + O 
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(i; + 1) 
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- 1 > log 2, 


V +1 n 


again using that v will not grow faster than n. Putting these together we have that 


_ log 


V + n 

V -\-l 


O 


V +1 


o\- 

n 


Case 1. ^ ^ c, for 0 < c < 1. For this case, write v = c*n/{l + 6) for <5 small, and 
note that then 


= —(1 + o(l))(l + (5) - log 


(c* + l) 


+ 0 ( 1 ). 


Since ^ — log j = 0, it is clear that S can be appropriately chosen as o(l) to 

make the derivative zero. 


Case 2. ^ ^ 0. Now choose v = n follows 

that 


log ^1 + —^ = (logn - log To + o(l))(l + 6) and 
log = [log^ “ log(^ + 1)](1 + o(l)) ■ 

Consider first the case v ^ oo. Then 


log(z; + 1) = (1 + o(l))(logro - loglog(l + n/ro)) = (1 + o(l))logro , 


so that 


= (logn - logro + o(l))(l + d) - (logn - logro)(l + o(l)) + o(l), 

and it is clear that 6 can again be chosen o(l) to make this zero. Lastly, if v < K < oo, 
then (logTo)/(logn) = o(l), so that '!''(?;) = (logn)(l + o(l))(l + <5) — (logn)(l + o(l)) + 
o(l), and 6 can again be chosen o(l) to make this zero, completing the proof. □ 

Table 1 gives the limiting behavior of v for various behaviors of the number of nonzero 
cells, tq. Only when rp = logn does the posterior mode of a (i.e., v/m) equal l/m, 
the value selected by the reference distance method. Of course, this is not surprising; 
empirical Bayes is using a fit to the data to help select a whereas the reference distance 
method is pre-experimental. 
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’’0 

cn (0 < c < 1) 

n" (0 < 6 < 1) 

(log n)" 

logn 

0(1) 

V 

c*n 

(l — b) log n 

(logn)(^“^) 

1 

0(1/logn) 


Table 1: The limiting behavior of D as n —>• oo, for various limiting behaviors of vq, the 
number of non-zero cells. 


4.2 Multivariate hypergeometric model 

Let A/+ be the set of all nonnegative integers. Consider a multivariate hypergeometric 
distribution Hy;.(rfc | n, Rk, N) with the probability mass function 


llyk{rk\n,Rk,N) 



Vk € 'Rk,nj 


(27) 


'R-k,n = {rk = {ri,--- ,Tk)-, rj€M+, ri H-h < n}, 

where the k unknown parameters Rk = (i?i, • • • , Rk) are in the parameter space TZk,N- 
Here and in the following, Rk+i = N — (i?i -I- • • • -I- Rk)- Notice that the univariate 
hypergeometric distribution is the special case when k = 1. 

A natural hierarchical model for the unknown Rk is to assume that it is multinomial 

Muk{Rk I N,pk), withpfc e Vk = {Pk = {pi, ■ ■ ■ ,Pk)}, 0<P]< 1, and Pi H- \-pk < 1. 

The probability mass function of Rk is then 


/VI 

MukiRk I N, Pk) = n pf' ■ 

llj=i Rj'- 1=1 

Berger, Bernardo and Sun (2012) prove that the marginal likelihood of Vk depends only 
on (n,pk) and it is given by 

p{rk\pk,n,N) = p{rk\pk,n)= ^ Byk{'<'k\n,Rk,N)Mnk{Rk\N,pk) 

Rk 

= Mufc(rfc I n,pfc), Tfe e (28) 

This reduces to the multinomial problem. Hence, the overall (approximate) reference 
prior for [Rk \ N,pk) would be Multinomial-Dirichlet Di(i2fc | 1/fe, • • • , l/k). 

4.3 Multi-normal means 

Let Xi be independent normal with mean pi and variance 1, for i = 1 ■ ■ ■ ,m. We are 
interested in all the pi and in\p\^ = p\-\ --|- p^. 

The natural hierarchical prior modeling approach is to assume that pi N(/ii | 0, r). 
Then, marginally, the Xi are iid N(a;i | 0, Vl -|- r^) and the reference (Jeffreys) prior 
for in this marginal model is 


7r'^(r^) oc (1 -I- T^) 
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The hierarchical prior for (and recommended overall prior) is then 



(29) 


This prior is arguably reasonable from a marginal reference prior perspective. For the 
individual fXi , it is a shrinkage prior known to result in Stein-like shrinkage estimates of 
the form 



with r(-) Ki p for large arguments. Such shrinkage estimates are often viewed as actually 
being superior to the reference posterior mean, which is just Xi itself. The reference 
prior when |^| is the parameter of interest is 



(30) 


which is similar to (29) in that, for large values of \fi\, the tails differ by only one power. 


Thus the hierarchical prior appears to be quite satisfactory in terms of its marginal 
posterior behavior for any of the parameters of interest. Of course, the same could be 
said for the single reference prior in (30); thus here is a case where one of the reference 
priors would be fine for all parameters of interest, and averaging among reference priors 
would not work. 

Computation with the reference prior in (30) can be done by a simple Gibbs sampler. 
Computation with the hierarchical prior in (29) is almost as simple, with the Gibbs step 
for being replaced by the rejection step: 

Step 1. Propose from the inverse gamma density proportional to 



Step 2. Accept the result with probability t^/(t^ -|- 1) (or else propose again). 

4.4 Bivariate normal problem 

Earlier for the bivariate normal problem, we only considered the two right-Haar priors. 
More generally, there is a continuum of right-Haar priors given as follows. Define an 
orthogonal matrix by 



where —7r/2 < /3 < 7r/2. Then it is straightforward to see that the right-Haar prior 
based on the transformed data FX is 


7r(/ri,//2,CTi,cr2,p|/3) 
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We thus have a class of priors indexed by a hyperparameter j3, and it might be 
tempting to try the hierarchical approach even though the class of priors is not a class 
of proper priors and hence there is no proper marginal distribution to utilize in finding 
the hyperprior for /3. The temptation here arises because /3 is in a compact set and it 
seems natural to use the (proper) uniform distribution (being uniform over the set of 
rotations is natural.) The resulting joint prior is 

1 

7r°(^i,/i2,cri,cr2,p) = - / 7r(/Xi,/r2,cri,cr2,p|/3)d/3, 

J-7r/2 

which equals the prior in (1), since 

/ 7r/2 

sin(/3) cos(/3)d/3 = 0, / sin^(/3)d/3 = / cos^(/3)(i/3 = constant. 

-7r/2 J-Tr/2 J-Tr/2 

Thus the overall prior obtained by the hierarchical approach is the same prior as ob¬ 
tained by just averaging the two reference priors. It was stated there that this prior is 
inferior as an overall prior to either reference prior individually, so the attempt to apply 
the hierarchical approach to a class of improper priors has failed. 

Empirical hierarchical approach: Instead of integrating out over /3, one could find 
the empirical Bayes estimate /3 and use 7r(^i, fj,2, Ui, 02, p \ $) as the overall prior. This 
was shown in Sun and Berger (2007) to result in a terrible overall prior, much worse 
than either the individual reference priors, or even -k^ in (1). 

5 Discussion 

When every parameter of a model has the same reference prior, this prior is very natural 
to use as the overall prior. A number of such scenarios were catalogued in Section 2. 
This common reference prior can depend on the parameterization chosen for the models 
(although it will be invariant to coordinatewise one-to-one-transformations). Indeed, an 
example was given in which a strange choice of model parameterization resulted in an 
inadequate common reference prior. 

The reference distance approach to developing an overall prior is natural, and seems 
to work well when the reference priors themselves are proper. It also appears to be pos¬ 
sible to implement the approach in the case where the reference priors are improper, by 
operating on suitable large compact sets and showing that the result is not sensitive to 
the choice of compact set. Of course, the approach is dependent on the parameterization 
used for the model and on having accepted reference priors available for all the param¬ 
eters in the model; it would have been more satisfying if the overall prior depended 
only on the model itself. The answer will also typically depend on weights used for the 
individual reference priors, although this can be viewed as a positive in allowing more 
important parameters to have more influence. The implementation considered in this 
paper also utilized a class of candidate priors, with the purpose of finding the candidate 
which minimized the expected risk. The result will thus depend on the choice of the 
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candidate class although, in principle, one could consider the class of all priors as the 
candidate class; the resulting minimization problem would be formidable, however. 

The hierarchical approach seems excellent (as usual), and can certainly be recom¬ 
mended if one can find a natural hierarchical structure based on a class of proper priors. 
Such hierarchical structures naturally occur in settings where parameters can be viewed 
as exchangeable random variables but may not be available otherwise. In the particu¬ 
lar examples considered, the overall prior obtained for the multi-normal mean problem 
seems fine, and the recommended hierarchical prior for the contingency table situation 
is very interesting, and seems to have interesting adaptations to sparsity; the same can 
be said for its empirical Bayes implementation. In contrast, the attempted application 
of the hierarchical and empirical Bayes idea to the bivariate normal problem using the 
class of right-Haar priors was highly unsatisfactory, even though the hyperprior was 
proper. This is a clear warning that the hierarchical or empirical Bayes approach should 
be based on an initial class of proper priors. 

The failure of arithmetic prior averaging in the bivariate normal problem was also 
dramatic; the initial averaging of two right-Haar priors gave an inferior result, which 
was duplicated by the continuous average over all right-Haar priors. Curiously in this 
example, the geometric average of the two right-Haar improper priors seems to be 
reasonable, suggesting that, if averaging of improper priors is to be done, the geometric 
average should be used. 

The ‘common reference prior’ and ‘reference distance’ approaches will give the same 
answer when a common reference prior exists. However, the reference distance and hier¬ 
archical approaches will rarely give the same answer because, even if the initial class of 
candidate priors is the same, the reference distance approach will fix the hyperparam¬ 
eter a, while the hierarchical approach will assign it a reference prior; and, even if the 
empirical Bayes version of the hierarchical approach is used, the resulting estimate of a 
can be different than that obtained from the reference distance approach, as indicated 
in the multinomial example at the end of Section 4.1. 

The ‘common reference prior’ and hierarchical approaches will mostly have different 
domains of applicability and are the recommended approaches when they can be applied. 
The reference distance approach will be of primary utility in situations such as the 
coefficient of variation example in Section 3.2, where there is no natural hierarchical 
structure to utilize nor common reference prior available. 
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